让非技术人员可以从数据库中提问,这是学术界和工业界多年来感兴趣的问题。最近,大型语言模型(LLM)技术(如GPT-4)的进展提高了所提出解决方案的准确性。然而,由于最先进的LLM尚未开放进行微调,因此最近在这一领域的研究集中在创建能够在不修改基础LLM的情况下实现复杂的自然语言到SQL(NL-to-SQL)场景的检索增强生成(RAG)算法。
上周,OpenAI开放了GPT-3.5-turbo供微调使用。在本文中,我们将微调自己的NL-to-SQL模型,并将其性能与最先进的RAG方法进行比较。我们将使用耶鲁大学的Spider数据集作为测试基准。
以下是来自Spider数据集的两个示例:样例。通过简单的处理,我们可以从这个数据集获得以下训练数据:
xxxxxxxxxx
101[
2 {
3 "query": "SELECT count(*) FROM singer",
4 "question": "有多少歌手?"
5 },
6 {
7 "query": "SELECT count(*) FROM singer",
8 "question": "歌手的总人数是多少?"
9 }
10]
对于微调GPT-3.5-Turbo,首先要做的是创建和上传训练数据集。由于GPT-3.5-Turbo是一个ChatModel,因此此数据集必须使用以下格式,并被上传为JSONL文件。
xxxxxxxxxx
31{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
2{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
3{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
Spider数据集有一个包含2147个问题/SQL对的测试集,1034个问题/SQL对的开发集和7000个问题/SQL对的训练集。我们将从Spider训练集中构建我们的微调数据集,使用上述结构。
一个NL-to-SQL任务的定义如下:给定一个问题和数据库,确定一个SQL查询,当对数据库执行该查询时,返回一个结果集,可以回答这个问题。对于如何更好地提示LLMs执行此任务,研究人员探索了各种方法,并普遍认为提示需要包括指令组件、Data Schema的详细信息、关于数据内容、一组任务特定的演示以及实际的问题。
给定ChatModel训练数据的格式,上述元素必须在以下三个提示中呈现:
system_prompt – 包含指令、Data Schema和数据库内容
user_prompt – 包含自然语言问题
assistant_prompt – 包含SQL查询和推理步骤
让我们看看如何为我们的NL-to-SQL训练数据集创建每个提示。
创建system_prompt是这个任务中最复杂的部分。至少,system_prompt需要包括:
系统指令
Data Schema
数据内容
此外,对于任何实际的用例,如果有大量的表,则训练集中的样本还应该训练模型选择正确的表格用于SQL查询(即执行模式链接)。
对于指令,我们使用了以下标准提示:
xxxxxxxxxx
31You are an assistant that is an expert in generating Sqlite SQL queries.
2Having the access to database content, generate a correct Sqlite SQL query for the given question.
3### Database content ###
对于Data Schema,文献中有许多提出的格式,但没有明确的共识,哪种格式表现最佳。我们发现以下是Data Schema的最佳表示方式:
xxxxxxxxxx
201CREATE TABLE concert (
2"concert_ID" INTEGER NOT NULL,
3"concert_Name" TEXT NOT NULL,
4"Theme" TEXT,
5"Stadium_ID" TEXT NOT NULL,
6"Year" TEXT,
7PRIMARY KEY ("concert_ID"),
8FOREIGN KEY("Stadium_ID")
9REFERENCES stadium ("Stadium_ID")
10)
11CREATE TABLE singer (
12"Singer_ID" INTEGER NOT NULL,
13"Name" TEXT,
14"Country" TEXT NOT NULL,
15"Song_Name" TEXT NOT NULL,
16"Song_release_year" TEXT,
17"Age" INTEGER,
18"Is_male" BOOLEAN NOT NULL,
19PRIMARY KEY ("Singer_ID")
20)
经过多次实验,我们发现以下模板在最佳的训练模型数据内容方面表现最好:
xxxxxxxxxx
241/*
2Columns in concert and 3 examples in each column for the high cardinality columns :
3concert_ID: 1025 , 1101 , 1247
4concert_Name : "Fire", "Dance", "Sky"
5Stadium_ID : 9, 10, 11
6*/
7/*
8Columns in concert and all categories for the low cardinality columns :
9Theme : " ROCK ", " POP ", " HIP-HOP "
10Year : 2022, 2021, 2023, 2020
11*/
12/*
13Columns in singer and 3 examples in each column for the high cardinality columns :
14Singer_ID : 10235 , 110231 , 1242447
15Name : "Jordan", "Gabriel", "Tiffany"
16Country : "Iran", "India", "Canada"
17Song_Name : "dance in the fire", "rain", "sky"
18Age : 19, 20, 21
19*/
20/*
21Columns in singer and all categories for the low cardinality columns :
22Is_male : "MALE", "FEMALE",
23Song_release_year : 2022, 2021, 2023, 2020
24*/
数据库内容中的一个重要元素是如何识别分类(低基数)列。区分低基数和高基数列的阈值取决于要微调的大型语言模型(LLM)的上下文窗口大小。给定GPT-3.5-turbo的4096令牌上下文窗口,我们确定20个令牌是低基数和高基数列之间适当的阈值。
为了为我们训练集创建正确的system_prompt,需要以这样的方式提供样本,以便训练模型在数据库上正确执行模式链接。为此,我们采用了以下启发式方法:对于每个单独的NL <> SQL样本,我们除了正确的表之外,还随机选择了数据库中的其他表,直到达到4000个令牌的上下文窗口限制为止。为了减少位置信息的影响,我们进一步随机化了表的顺序。简而言之,每个system_prompt包括相关表的模式和内容与其他不相关的表混合在一起,帮助训练模型选择查询的正确表。
现在,我们将所有这些放在一起,构建我们的system_prompt。
对于来自Spider的以下示例:
xxxxxxxxxx
21question:"How many heads of the departments are older than 56?"
2SQL: "SELECT count(*) FROM head WHERE age > 56"
system_prompt将是
xxxxxxxxxx
791You are an assistant that is an expert in generating Sqlite SQL queries.
2Having the access to database content, generate a correct Sqlite SQL query for the given question.
3### Database content ###CREATE TABLE trip (
4id INTEGER, duration INTEGER,
5start_date TEXT,
6start_station_name TEXT,
7start_station_id INTEGER,
8end_date TEXT,
9end_station_name TEXT,
10end_station_id INTEGER,
11bike_id INTEGER,
12subscription_type TEXT,
13zip_code INTEGER,
14PRIMARY KEY (id)
15)
16/* Columns in trip and 3 examples in each column for high cardinality columns :
17id : 900645, 900752, 900524
18duration : 1131, 2146, 1155
19start_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16
20start_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
21start_station_id : 56, 65, 49
22end_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32
23end_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
24end_station_id : 56, 65, 49
25bike_id : 586, 56, 65
26zip_code : 94070, 94530, 94040–1724
27*/
28/* Columns in trip and all categories for low cardinality columns :
29subscription_type : Customer, Subscriber
30*/
31
32CREATE TABLE management (
33"department_ID" INTEGER,
34"head_ID" INTEGER,
35temporary_acting TEXT,
36PRIMARY KEY ("department_ID", "head_ID"),
37FOREIGN KEY("head_ID") REFERENCES head ("head_ID"),
38FOREIGN KEY("department_ID") REFERENCES department ("Department_ID")
39)
40/* Columns in management and all categories for low cardinality columns :
41department_ID : 7, 15, 2, 11
42head_ID : 5, 4, 6, 3, 10
43temporary_acting : Yes, No
44*/
45
46CREATE TABLE department (
47"Department_ID" INTEGER,
48"Name" TEXT,
49"Creation" TEXT,
50"Ranking" INTEGER,
51"Budget_in_Billions" REAL,
52"Num_Employees" REAL,
53PRIMARY KEY ("Department_ID")
54)
55/* Columns in department and 3 examples in each column for high cardinality columns :
56Department_ID : 1, 13, 11
57Name : Energy, Interior, Health and Human Services
58Creation : 1913, 1979, 1989
59Ranking : 1, 13, 11
60Budget_in_Billions : 10.7, 77.6, 59.7
61Num_Employees : 112557.0, 3000000.0, 235000.0
62*/
63
64...
65CREATE TABLE head (
66"head_ID" INTEGER,
67name TEXT,
68born_state TEXT,
69age REAL,
70PRIMARY KEY ("head_ID")
71)
72/* Columns in head and all categories for low cardinality columns :
73head_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9
74name : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham
75born_state : Delaware, Connecticut, Alabama, California, Florida
76age : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0
77*/
78
79...
用户提示很简单,每个Spider样本的用户问题。例如:
xxxxxxxxxx
11How many heads of the departments are older than 56?
助理提示也很简单,包含了来自Spider的相关SQL查询和找到正确列和正确表的推理步骤。为了构建推理步骤,我们只需提取在SQL查询中使用的表和列。例如:
xxxxxxxxxx
41To construct the query, I'll be working with the following tables: head.
2From these tables, I'll be using the following columns: age.
3The SQL query I'll be generating is:
4SELECT count(*) FROM head WHERE age > 56
xxxxxxxxxx
161{
2 "messages": [
3 {
4 "role": "system",
5 "content": "\nYou are an assistant that is an expert in generating sqlite SQL queries.\nHaving the access to database content, generate a correct sqlite SQL query for the given question.\n### Database content ###\n \nCREATE TABLE trip (\n\tid INTEGER, \n\tduration INTEGER, \n\tstart_date TEXT, \n\tstart_station_name TEXT, \n\tstart_station_id INTEGER, \n\tend_date TEXT, \n\tend_station_name TEXT, \n\tend_station_id INTEGER, \n\tbike_id INTEGER, \n\tsubscription_type TEXT, \n\tzip_code INTEGER, \n\tPRIMARY KEY (id)\n)\n/*\nColumns in trip and 3 examples in each column for high cardinality columns :\nid : 900645, 900752, 900524\nduration : 1131, 2146, 1155\nstart_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16\nstart_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th\nstart_station_id : 56, 65, 49\nend_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32\nend_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th\nend_station_id : 56, 65, 49\nbike_id : 586, 56, 65\nzip_code : 94070, 94530, 94040-1724\n*/\n/*\nColumns in trip and all categories for low cardinality columns :\nsubscription_type : Customer, Subscriber\n*/\n \nCREATE TABLE \"Problems\" (\n\tproblem_id INTEGER, \n\tproduct_id INTEGER NOT NULL, \n\tclosure_authorised_by_staff_id INTEGER NOT NULL, \n\treported_by_staff_id INTEGER NOT NULL, \n\tdate_problem_reported DATETIME NOT NULL, \n\tdate_problem_closed DATETIME, \n\tproblem_description VARCHAR(255), \n\tother_problem_details VARCHAR(255), \n\tPRIMARY KEY (problem_id), \n\tFOREIGN KEY(reported_by_staff_id) REFERENCES \"Staff\" (staff_id), \n\tFOREIGN KEY(product_id) REFERENCES \"Product\" (product_id), \n\tFOREIGN KEY(closure_authorised_by_staff_id) REFERENCES \"Staff\" (staff_id)\n)\n/*\nColumns in Problems and 3 examples in each column for high cardinality columns :\nproblem_id : 1, 13, 11\nclosure_authorised_by_staff_id : 1, 13, 2\ndate_problem_reported : 1995-05-14 08:32:56, 1988-11-07 16:09:31, 1986-11-13 07:30:55\ndate_problem_closed : 1974-09-20 13:42:19, 1997-10-18 20:09:57, 2004-06-20 01:08:25\nproblem_description : d, i, s\n*/\n/*\nColumns in Problems and all categories for low cardinality columns :\nproduct_id : 1, 13, 2, 5, 7, 8, 4, 6, 15\nreported_by_staff_id : 1, 13, 11, 2, 5, 7, 4, 14, 10\nother_problem_details : f, m, i, s, k, l, p, v, c\n*/\n \nCREATE TABLE management (\n\t\"department_ID\" INTEGER, \n\t\"head_ID\" INTEGER, \n\ttemporary_acting TEXT, \n\tPRIMARY KEY (\"department_ID\", \"head_ID\"), \n\tFOREIGN KEY(\"head_ID\") REFERENCES head (\"head_ID\"), \n\tFOREIGN KEY(\"department_ID\") REFERENCES department (\"Department_ID\")\n)\n/*\nColumns in management and all categories for low cardinality columns :\ndepartment_ID : 7, 15, 2, 11\nhead_ID : 5, 4, 6, 3, 10\ntemporary_acting : Yes, No\n*/\n \nCREATE TABLE category (\n\tcategory_id INTEGER NOT NULL, \n\tname VARCHAR(25) NOT NULL, \n\tlast_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL, \n\tPRIMARY KEY (category_id)\n)\n/*\nColumns in category and 3 examples in each column for high cardinality columns :\ncategory_id : 1, 16, 13\nname : Family, Sci-Fi, Action\n*/\n/*\nColumns in category and all categories for low cardinality columns :\nlast_update : 2006-02-15 04:46:27\n*/\n \nCREATE TABLE ship (\n\t\"Ship_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Type\" TEXT, \n\t\"Nationality\" TEXT, \n\t\"Tonnage\" INTEGER, \n\tPRIMARY KEY (\"Ship_ID\")\n)\n/*\nColumns in ship and all categories for low cardinality columns :\nShip_ID : 1, 2, 5, 7, 8, 4, 6, 3\nName : Clan McTavish, Farringford, Appam, Author, Dromonby, Corbridge, Trader, Ariadne\nType : Battle ship, Cargo ship\nNationality : United States, United Kingdom\nTonnage : 3035, 3146, 7781, 3496, 3687, 5816, 3627, 3608\n*/\n \nCREATE TABLE member_attendance (\n\t\"Member_ID\" INTEGER, \n\t\"Performance_ID\" INTEGER, \n\t\"Num_of_Pieces\" INTEGER, \n\tPRIMARY KEY (\"Member_ID\", \"Performance_ID\"), \n\tFOREIGN KEY(\"Performance_ID\") REFERENCES performance (\"Performance_ID\"), \n\tFOREIGN KEY(\"Member_ID\") REFERENCES member (\"Member_ID\")\n)\n/*\nColumns in member_attendance and all categories for low cardinality columns :\nMember_ID : 1, 11, 2, 5, 7, 4, 3\nPerformance_ID : 1, 2, 4, 6, 3\nNum_of_Pieces : 1, 2, 4, 3\n*/\n \nCREATE TABLE department (\n\t\"Department_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Creation\" TEXT, \n\t\"Ranking\" INTEGER, \n\t\"Budget_in_Billions\" REAL, \n\t\"Num_Employees\" REAL, \n\tPRIMARY KEY (\"Department_ID\")\n)\n/*\nColumns in department and 3 examples in each column for high cardinality columns :\nDepartment_ID : 1, 13, 11\nName : Energy, Interior, Health and Human Services\nCreation : 1913, 1979, 1989\nRanking : 1, 13, 11\nBudget_in_Billions : 10.7, 77.6, 59.7\nNum_Employees : 112557.0, 3000000.0, 235000.0\n*/\n\n \nCREATE TABLE chip_model (\n\t\"Model_name\" TEXT, \n\t\"Launch_year\" REAL, \n\t\"RAM_MiB\" REAL, \n\t\"ROM_MiB\" REAL, \n\t\"Slots\" TEXT, \n\t\"WiFi\" TEXT, \n\t\"Bluetooth\" TEXT, \n\tPRIMARY KEY (\"Model_name\")\n)\n/*\nColumns in chip_model and 3 examples in each column for high cardinality columns :\nModel_name : X30 mid-range, X50 Advanced, X51 mid-range\n*/\n/*\nColumns in chip_model and all categories for low cardinality columns :\nLaunch_year : 2002.0, 2005.0, 2004.0, 2003.0\nRAM_MiB : 32.0, 64.0\nROM_MiB : 48.0, 256.0, 128.0, 32.0, 64.0\nSlots : 1CFII,1SD, 1SD\nWiFi : 802.11b, No\nBluetooth : 1.2, Yes, No, 1.1\n*/\n \nCREATE TABLE head (\n\t\"head_ID\" INTEGER, \n\tname TEXT, \n\tborn_state TEXT, \n\tage REAL, \n\tPRIMARY KEY (\"head_ID\")\n)\n/*\nColumns in head and all categories for low cardinality columns :\nhead_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9\nname : Jeff Maggert, Pádraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio García, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham\nborn_state : Delaware, Connecticut, Alabama, California, Florida\nage : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0\n*/\n \nCREATE TABLE mountain (\n\t\"Mountain_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Height\" REAL, \n\t\"Prominence\" REAL, \n\t\"Range\" TEXT, \n\t\"Country\" TEXT, \n\tPRIMARY KEY (\"Mountain_ID\")\n)\n/*\nColumns in mountain and all categories for low cardinality columns :\nMountain_ID : 1, 2, 5, 7, 4, 6, 3\nName : Ngaliema / Mt Stanley (Margherita Pk), Mount Kenya (Lenana), Kibo (Uhuru Pk), Ngaliema / Mt Stanley (Savoia Pk), Mount Kenya (Batian), Duwoni / Mt Speke (Vittorio Emanuele Pk), Mawenzi (Hans Meyer Pk)\nHeight : 5109.0, 5199.0, 5895.0, 4890.0, 4985.0, 4977.0, 5148.0\nProminence : 720.0, 850.0, 3951.0, 3825.0, 130.0, 5885.0, 110.0\nRange : Kilimanjaro, Mount Kenya, Rwenzori\nCountry : DR Congo Uganda, Uganda, Tanzania, Kenya\n*/\n \nCREATE TABLE \"Restaurant_Type\" (\n\t\"ResTypeID\" INTEGER, \n\t\"ResTypeName\" VARCHAR(40), \n\t\"ResTypeDescription\" VARCHAR(100), \n\tPRIMARY KEY (\"ResTypeID\")\n)\n/*\nColumns in Restaurant_Type and all categories for low cardinality columns :\nResTypeID : 1, 2\nResTypeName : Sandwich, Stir-fry\nResTypeDescription : Classic Chinese cooking., Simplest there is.\n*/\n \nCREATE TABLE farm_competition (\n\t\"Competition_ID\" INTEGER, \n\t\"Year\" INTEGER, \n\t\"Theme\" TEXT, \n\t\"Host_city_ID\" INTEGER, \n\t\"Hosts\" TEXT, \n\tPRIMARY KEY (\"Competition_ID\"), \n\tFOREIGN KEY(\"Host_city_ID\") REFERENCES city (\"City_ID\")\n)\n/*\nColumns in farm_competition and all categories for low cardinality columns :\nCompetition_ID : 1, 2, 5, 4, 6, 3\nYear : 2004, 2013, 2005, 2006, 2003, 2002\nTheme : MTV Cube, Valentine's Day, Codehunters, Carnival M is back!, Aliens, MTV Asia Aid\nHost_city_ID : 1, 2, 5, 4, 3\nHosts : Mandy Moore and Ronan Keating, Alicia Keys, Shaggy and Coco Lee, Leehom Wang and Kelly Rowland, Miley Cyrus Jared Leto and Karen Mok, Vanness Wu and Michelle Branch\n*/\n \nCREATE TABLE \"Country\" (\n\tid INTEGER, \n\tname TEXT, \n\tPRIMARY KEY (id)\n)\n/*\nColumns in Country and 3 examples in each column for high cardinality columns :\nid : 1, 19694, 7809\nname : Scotland, Italy, Spain\n*/\n\n \nCREATE TABLE artist (\n\tartist_name TEXT(50) NOT NULL, \n\tcountry TEXT(20), \n\tgender TEXT(20), \n\tpreferred_genre TEXT(50), \n\tCONSTRAINT a_name PRIMARY KEY (artist_name), \n\tFOREIGN KEY(preferred_genre) REFERENCES genre (g_name) ON DELETE CASCADE\n)\n/*\nColumns in artist and all categories for low cardinality columns :\nartist_name : Prity, Michel, Topu, Shrikanta, Enrique, Farida\ncountry : India, UK, USA, Bangladesh\ngender : Male, Female\npreferred_genre : tagore, folk, modern, nazrul, blues, pop\n*/\n \nCREATE TABLE \"Organizations\" (\n\torganization_id INTEGER NOT NULL, \n\tparent_organization_id INTEGER, \n\torganization_details VARCHAR(255), \n\tPRIMARY KEY (organization_id)\n)\n/*\nColumns in Organizations and all categories for low cardinality columns :\norganization_id : 7, 8, 10\nparent_organization_id : 7, 8\norganization_details : Denesik and Sons Party, Reinger, Hudson and Nolan Group, Robel-Schulist Group\n*/\n \nCREATE TABLE school (\n\t\"School_ID\" INTEGER, \n\t\"School\" TEXT, \n\t\"Location\" TEXT, \n\t\"Enrollment\" REAL, \n\t\"Founded\" REAL, \n\t\"Denomination\" TEXT, \n\t\"Boys_or_Girls\" TEXT, \n\t\"Day_or_Boarding\" TEXT, \n\t\"Year_Entered_Competition\" REAL, \n\t\"School_Colors\" TEXT, \n\tPRIMARY KEY (\"School_ID\")\n)\n/*\nColumns in school and all categories for low cardinality columns :\nSchool_ID : 1, 2, 5, 4, 6, 3\nSchool : St Aloysius' College, Cranbrook School, Waverley College, Knox Grammar School, Barker College, Trinity Grammar School\nLocation : Hornsby, Summer Hill, Waverley, Bellevue Hill, Milsons Point, Wahroonga\nEnrollment : 1000.0, 1850.0, 2200.0, 1200.0, 2300.0, 1430.0\nFounded : 1918.0, 1924.0, 1913.0, 1879.0, 1903.0, 1890.0\nDenomination : Catholic, Uniting Church, Anglican\nBoys_or_Girls : Boys only to Yr 9 Co-ed Year 10 to 12, Boys\nDay_or_Boarding : Day, Day & Boarding\nYear_Entered_Competition : 1944.0, 1929.0\nSchool_Colors : Royal Blue and Gold, Black & Blue, Red, White & Blue, Red & Blue, Green and White\n*/\n \nCREATE TABLE flight (\n\tid INTEGER, \n\t\"Vehicle_Flight_number\" TEXT, \n\t\"Date\" TEXT, \n\t\"Pilot\" TEXT, \n\t\"Velocity\" REAL, \n\t\"Altitude\" REAL, \n\tairport_id INTEGER, \n\tcompany_id INTEGER, \n\tPRIMARY KEY (id), \n\tFOREIGN KEY(company_id) REFERENCES operate_company (id), \n\tFOREIGN KEY(airport_id) REFERENCES airport (id)\n)\n/*\nColumns in flight and 3 examples in each column for high cardinality columns :\nid : 1, 13, 11\nVehicle_Flight_number : M2-F1 #14, M2-F1 #61, M2-F1 #0\nDate : July 16, 1965, May 19, 1964, March 28, 1966\n*/\n/*\nColumns in flight and all categories for low cardinality columns :\nPilot : Thompson, Peterson\nVelocity : 240.0, 135.0\nAltitude : 3650.0, 0.0\nairport_id : 1, 2, 5, 8, 4, 6, 3, 9\ncompany_id : 1, 13, 11, 2, 5, 7, 4, 6, 3, 9\n*/\n \nCREATE TABLE \"Type_Of_Restaurant\" (\n\t\"ResID\" INTEGER, \n\t\"ResTypeID\" INTEGER, \n\tFOREIGN KEY(\"ResID\") REFERENCES \"Restaurant\" (\"ResID\"), \n\tFOREIGN KEY(\"ResTypeID\") REFERENCES \"Restaurant_Type\" (\"ResTypeID\")\n)\n/*\nColumns in Type_Of_Restaurant and all categories for low cardinality columns :\nResID : 1, 2\nResTypeID : 1, 2\n*/\n \nCREATE TABLE journalist (\n\t\"journalist_ID\" INTEGER, \n\t\"Name\" TEXT, \n\t\"Nationality\" TEXT, \n\t\"Age\" TEXT, \n\t\"Years_working\" INTEGER, \n\tPRIMARY KEY (\"journalist_ID\")\n)\n/*\nColumns in journalist and 3 examples in each column for high cardinality columns :\njournalist_ID : 1, 11, 2\nName : Jack Meaney, Herbert Swindells, Jackie Waring\n*/\n/*\nColumns in journalist and all categories for low cardinality columns :\nNationality : Northern Ireland, Wales, England\nAge : 37, 28, 25, 33, 34, 43, 27, 29\nYears_working : 1, 5, 7, 8, 21, 6, 3, 12, 10, 9\n*/\n \nCREATE TABLE list (\n\t\"LastName\" TEXT, \n\t\"FirstName\" TEXT, \n\t\"Grade\" INTEGER, \n\t\"Classroom\" INTEGER, \n\tPRIMARY KEY (\"LastName\", \"FirstName\")\n)\n/*\nColumns in list and 3 examples in each column for high cardinality columns :\nLastName : HOUTCHENS, GELL, FLACHS\nFirstName : RAY, EMILE, PATRINA\nClassroom : 109, 110, 106\n*/\n/*\nColumns in list and all categories for low cardinality columns :\nGrade : 1, 2, 5, 4, 6, 3, 0\n*/\n"
6 },
7 {
8 "role": "user",
9 "content": "How many heads of the departments are older than 56 ?"
10 },
11 {
12 "role": "assistant",
13 "content": "\nTo construct the query, I'll be working with the following tables: head.\nFrom these tables, I'll be using the following columns: age.\nThe SQL query I'll be generating is:\nSELECT count(*) FROM head WHERE age > 56"
14 }
15 ]
16}
一旦我们创建了JSONL文件(可以在这里或ipfs_here找到一个小样本),下一步是使用以下命令将创建的文件上传到OpenAI:
xxxxxxxxxx
21openai.api_key = os.getenv("OPENAI_API_KEY")
2print(openai.File.create(file=open("spider-finetuning.jsonl", "rb"),purpose='fine-tune'))
上传文件后,可以使用以下命令检查上传的状态:
xxxxxxxxxx
31print(openai.File.retrieve(id="file-id"))
2# 或者
3print(openai.File.list())
结果应该类似于:
xxxxxxxxxx
101{
2"object": "file",
3"id": "file-id",
4"purpose": "fine-tune",
5"filename": "file",
6"bytes": 71699079,
7"created_at": 1693343752,
8"status": "uploaded",
9"status_details": null
10}
当状态变为已处理时(类似于下面的示例),您可以将文件用于微调:
xxxxxxxxxx
101{
2"object": "file",
3"id": "file-id",
4"purpose": "fine-tune",
5"filename": "file",
6"bytes": 71699079,
7"created_at": 1693343752,
8"status": "processed",
9"status_details": null
10}
现在,我们准备开始微调作业。可以使用以下python代码创建一个微调作业:
xxxxxxxxxx
81print(openai.FineTuningJob.create(
2training_file="file-id",
3model="gpt-3.5-turbo",
4suffix = "spider",
5hyperparameters = {
6"n_epochs": #number_of_epochs,
7})
8)
微调过程的持续时间将根据微调数据集的大小而有所不同。微调有一个最大令牌限制,设置为50000000个令牌。因此,在使用Spider数据集时,我们将样本数量从7000个减少到5750个,并进行总共2个时期的微调。
您可以使用以下命令检查微调作业的状态:
xxxxxxxxxx
11print(openai.FineTuningJob.retrieve(id="ftjob-id"))
结果应类似于:
xxxxxxxxxx
191{
2"object": "fine_tuning.job",
3"id": "ftjob-id",
4"model": "gpt-3.5-turbo-0613",
5"created_at": 1693346245,
6"finished_at": 1693353313,
7"fine_tuned_model": "ft:gpt-3.5-turbo-0613:dataherald:spider:id",
8"organization_id": "org-id",
9"result_files": [
10"file-id"
11],
12"status": "succeeded",
13"validation_file": null,
14"training_file": "file-id",
15"hyperparameters": {
16"n_epochs": 2
17},
18"trained_tokens": 44722020
19}
DIN-SQL是一种用于自然语言到SQL转换的模型,它在处理Spider数据集上取得了最新的研究成果。DIN-SQL代表"Denoising-inductive SQL generation",该模型结合了去噪自编码器和归纳式学习的方法,使用GPT-4作为生成器模型来生成SQL查询语句。DIN-SQL使用了各种高级提示技术,包括少量示例提示、思路链提示和分解提示,以提高其性能和准确度。该模型具有较高的准确性和效率,但在成本和处理时间方面可能较高。
我们将微调模型的性能与未经微调的GPT3.5-Turbo和DIN-SQL + GPT-4(Spider的当前最先进方法)进行了基准测试,以获得零-shot性能。
结果如下
微调的G-3.5-Turbo的性能与之前的方法相比进的提示技术,包括少量提示、思维链提示和分解提示)保持一致,这是当前最先进的方法。
关键是,与DIN-SQL + GPT-4方法相比,微调模型显著降低了成本和处理时间。下表提供了每个来自Spider基准的问题的不同模型之间的大致成本和速度。
不同模型每个问题的成本和速度(来自Spider基准)
如上所示,与DIN-SQL与GPT-4相比,微调的GPT-3.5-Turbo模型的成本降低了30倍,速度提高了12倍。
通过投入时间和金钱来构建训练数据集,可以在准确性上与最先进的方法相匹配,同时速度提高12倍,成本降低30倍。如果,针对具体的业务做针对性的微调准确性应该可以进一步提高。
参考:
https://medium.com/dataherald/fine-tuning-gpt-3-5-turbo-for-natural-language-to-sql-4445c1d37f7c